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Abstract 

Four totally parallel algorithms for the solution of a sparse linear system have common characteristics 
which become quite apparent when they are implemented on a highly parallel hypercube such as the CM2. 
These four algorithms are Parallel Superconvergent Multigrid (PSMG) of Frederickson and McBryan, 
Robust Multigrid (RMG) of Hackbusch, the FFT based Spectral Algorithm, and Parallel Cyclic Reduction. 
In fact, all four can be formulated as particular cases of the same totally parallel multilevel algorithm, 
which we will refer to as TPMA. In certain cases the spectral radius of TPMA is zero, and it is recognized 
to be a direct algorithm. In many other cases the spectral radius, although not zero, is small enough that 
a single iteration per timestep keeps the local error within the required tolerance. 
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1. Introduction 

The multilevel algorithms we are discussing are all designed to solve large and highly 
parallel problems on massively parallel computers in as short a time as possible. We say 
that a problem is highly parallel if the defining equations can be applied efficiently to a 
proposed solution on a massively parallel computer. The operative definition of the phrase 
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massively parallel computer may well change with time as increasingly parallel machines 
are built. We expect, however, that the Connection Machine [1] is sufficiently parallel that 
the algorithms we demonstrate on it now will be effective on the still larger machines that 
we will see in the near future. Because parallel supercomputers are now becoming available 
it is reasonable to begin the development of algorithms capable of solving those very large 
problems which have, in the past, been too large and complex to consider seriously. 

We describe the general algorithm TPMA (Totally Parallel Multilevel Algorithm) in 
section three, after establishing some notation in section two. In sections four through 
seven we show that this algorithm has as special cases the algorithms PSMG (Parallel 
Superconvergent Multigrid) of Frederickson and McBryan [2,3], RMG (robust Multigrid) 
of Hackbusch [4], the Fast Fourier Transform based Spectral Method [5,6], and Parallel 
Cyclic Reduction [7-9]. It seems plausible that the common framework we establish will 
lead to a clearer understanding of the functioning of all of these algorithms, and aid in the 
development of improved algorithms through hybridization of the old. 

To put the algorithm TPMA in a proper perspective we will first describe a class of 
problems for which one version or another of this general algorithm appears to be optimal, 
depending on the circumstances. We will consider, for this purpose, a class of problems that 
arise in computational models of three dimensional fluid flow based on partial differential 
equations. Numerical solution of the generic problem in this class often requires repeated 
solution of an equation of the form 


H <• ) = », (ii) 

where J- denotes a system of nonlinear differential operators, v is a collection of vector 
and scalar fields representing the problem data, and u is the desired solution, also a 
collection of vector and scalar fields. There are no direct algorithms for solving problems 
of this class. Thus we will need to develop a sequence u n of approximate solutions which 
define u as their limit. All known methods for the construction of a convergent sequence 
of approximate solutions to a nonlinear problem of this sort require the evaluation of a 
residual 
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r n = 

at each step, followed by a defect correction step in which tt n +i is specified by constructing 
the increment ti n+ i — u n as a function of the residual r n . 


2. Approximate Inverse Operators. The typical algorithm for the construction 
of u n +i — u n as a function of r n is linear in r n . It is an algorithmic representation of 
a linear operator B : y — ► X, and we are able to write 

u n + 1 - u n = B r n . (2.1) 

Since any matrix representation of the restriction of B to finite dimensioned subspaces of 
X and y would only be of theoretical interest for problems of the size that interest us 
here, we will restrict our attention to algorithmic representations B of B. 

For the best of the algorithms we are able to provide an error bound of the form 

|| r n - A ( u n+1 - tin ) || < e|| r n ||, (2.2) 

in which the operator A approximates the linearization of T about ti n . It follows from 
(2.1) and (2.2) that the operators A and B are related by 

|| I - AB\\ < e. (2.3) 

We will refer to B as an c- approximate inverse of the operator A if it satisfies (2.3). 
Our goal is to construct an approximate inverse operator which is highly parallel, has low 
operation count, and is highly accurate. The algorithm TPM A that we describe below 
appears to meet these needs in many situations of interest. 

- 3 - 



3: The Totally Parallel Multilevel Algorithm TPMA 

Multilevel algorithms for the solution of a large sparse linear system are characterized 
by their effective use of solutions to a hierarchy of smaller and simpler linear systems to 
construct a solution to the given system. The general multilevel algorithm for the solution 
of a problem at level k in the hierarchy, which we will denote by B k , involves the following 
four steps: 

Projection. The problem data is projected onto one or more smaller systems at 
level k — 1 using a linear operator that we will denote P k . 

Subsystem Solution. Each of these smaller systems is solved using the algo- 
rithmic linear operator B k ~ i . 

Interpolation. These solutions are combined, using a generalized interpolation 
operator Q k , into an approximate solution at level k. 

Smoothing. A smoothing operator S fc , usually an approximate inverse in its 
own right, is applied to the resulting residual at level k. 

Since each of P k , Q k and S k is linear, we are able to describe the algorithm B k for 
k > 0 with the equation 

B k = S k + (I - S k A k )Q k B k ~ 1 P k (3.1) 

while at the lowest level B° = (A 0 ) -1 . In some cases, as we shall see, the algorithm 
TPMA is exact with S k = 0, in which case (3.1) reduces to 

B k = Q k B k ~ 1 P k 1 (3.1') 

which may be compactly represented by the commutative diagram 
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(3.2) 


x k y k 

}q‘ jp* 

Tjfc— 1 

x k ~ 1 *— y k ~ l 

In other cases either P k or Q k is the identity operator, further simplifying the computa- 
tion. The recursive definition of B k is closely related to the recursive definition of the 
operator A k at level k using the Galerkin projection A k ~ l = P k A* Q*, illustrated in 
the commutative diagram 

x k ^ y k 

jV [p k (3.3) 

X k-\ ^ yk - 1 

This definition of the operator A k is used in most versions of the algorithm TPMA, 
although we will see exceptions. 

4: The algorithm PSMG 

The algorithm PSMG (Parallel Superconvergent Multigrid) is our primary example of 
a totally parallel multilevel algorithm. For clarity we consider first a rather simple problem, 
namely the two dimensional Laplacian on a rectangle with periodic boundary conditions. 
The fact that this problem is singular will cause us no difficulty. In this case the operators 
commute, being convolution operators, and we are able to afford a significant saving in 
storage by taking 


P k = I, 0 < Jfe < /. (4.1) 

As a result of this simplification we do not need to store the projections of the problem 
data, but always use the data at the highest level. Moreover, we are able to write u k over 
tt* -1 , so the solution process requires only two arrays in all (in addition to the temporary 
variables introduced by the compiler). The savings is almost a factor of log(n ), which 
allows us to solve much larger problems. 
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In our implementation on the Connection Machine we have chosen a data structure for the 
arrays r k and u k which leaves them superimposed on the arrays for the higher and lower 
levels. This means that the operators at level k are scaled, when k < l , to operate on 
grid points a distance V apart, where j — l — k. The coefficients are invariant from 
one level to the next, in this simple case, but the scales on which they operate are not. 

We have found that there is an advantage to a larger operator Q, which now takes 
on the role of P as well. We have obtained very good results with 

-.005464 .013850 -.017536 .013850 -.005464' 

.013850 .062500 .097300 .062500 .013850 

-.017536 .097300 .341997 .097300 -.017536 (4.5) 

.013850 .062500 .097300 .062500 .013850 

-.005464 .013850 -.017536 .013850 -.005464. 

and the smoothing operator Z given by 

.01566 .04649 .01566* 

Z = .04649 .3059 .04649 (4.5) 

.01566 .04649 .01566 _ 

The algorithm PSMG was developed by Oliver McBryan and the author [3,4] after 
McBryan had implemented a standard multigrid algorithm on the Connection Machine and 
observed good convergence but inefficient operation: most of the processors were idle most 
of the time. We asked ourselves how these idle processors could be used most efficiently in 
furthering the computation. For further motivation behind the multigrid method see the 
early paper of Brandt [10]. The recent book of Hackbusch [11] provides an excellent review 
of current multigrid algorithms, and puts their convergence analysis in a firm theoretical 
framework, while a variety of parallel implementations are discussed in references [12-16]. 

5: Hackbusch’s Robust Multi-Grid algorithm RMG 

In the development of PSMG it was assumed that the approximate problems at the 
next lower level would be nearly identical, differing only in their interwoven relationship 
to the finer grid. Hackbusch observes [4] that there are advantages, particularly when the 
given problem is highly variable in space, to using different projection operator » for each of 
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these lower level approximations. Since the algorithm has a wider domain of convergence 
(even though it may not converge as rapidly for nice problems) he refers to it as RMG, for 
Robust Multi- Grid. 

Hackbusch specifies the four projection operators Po,o, Po,i, Pi,o and P\ y \ in a two- 
dimensional example of PSMG (he uses the notation ) to be the duals of the four 
interpolation operators <?o,o» $ 0 , 1 , $ 1,0 and Qi,i (for which his notation is ) given by 
the equations 
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( 5 . 1 ) 


It is clear that this may be implemented with essentially the same software that we have 


used in implementing PSMG, with a short preprocessing step in which the signs 


of some 


of the coefficients are changed. 


6: Spectral Methods based on the FFT 

The classical spectral algorithm for the solution of Poisson’s problem on a square or 
cube has three steps: take the FFT of the problem data v , divide this by the transform 
of the Laplacian (or some other constant coefficient differential operator), and transform 
back. We may, however, equally well describe it as an example of the algorithm TPMA in 
which the operator P is given, in two dimensions, by 
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It is clear that this may be implemented with essentially the same TPMA software that we 
have used in implementing PSMG or RML, which helps to clarify the strong theoretical 
relationship between these fast algorithms. It is often advantageous, however, to implement 
these operators one dimension at a time, in effect first doing 


ft = .00 


1.0 u>* io ) Pi = -p( 1.0 -a; fc io .00) (6.2) 

\/2 


followed by pair of operators transverse to these. Actually, there may be a strong compu- 
tational advantage in not multiplying by the indicated zero, for on the connection machine 
this means that this data does not need to be communicated. As we will see later, it is 
possible to arrange the communication on a hypercube, and the CM2 in particular, in 
such a way that the communication step takes just half as long. There may also be an 
advantage to keeping the factor on the cuff until the end of the computation, and 
divide once by 2 l . 

An important observation is that the Fourier multipliers u>k l need be computed only 
once, on the highest level. Those coefficients needed at the next lower level are a Hamming 
distance at most two away, and are moved in at the start of the computation at that level. 


7: Parallel Cyclic Reduction Algorithms. 

There are important applications which require the solution of a separable elliptic 
equation on a rectangle, or the image of a rectangle under a separable coordinate trans- 
formation. For many years the algorithm of choice, when the circumstances are right, 
has been one of the generalizations of the Cyclic Reduction algorithm of Hockney [17] and 
Buneman [18]. Notable among these generalizations are the Fast Poisson Solver of Buzbee, 
Golub and Nielsen [19], the algorithm FACR of Hockney [20], and the FISHPAK routines 
of Swartztrauber and Sweet [21-22]. Jespersen and Levit [23] have recently implemented 
a Navier-Stokes solver on the connection machine at Ames Research Center in which they 
use a parallel algorithm developed by Levit and based on the the work of Hockney and 
Jessope [7]. For further discussions of parallel generalizations see Johnsson [24] and Ortega 
[25]. 
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Consider first the classic cyclic reduction algorithm. It is easily described as Gaussian 
elimination in an unusual order: the even numbered rows are eliminated first, then half 
the remaining rows, and so on until log(n) steps later only one block remains. (Thus the 
alternate name odd even elimination for the algorithm.) The parallel cyclic reduction 
algorithm of Hockney and Jessope [7] stems from the observation that when one has 
sufficiently many processors it is efficient to apply this transformation in parallel on all 
the rows for all log(n) steps of the reduction process at which point one need only do 
one (parallel) solve. When the parallel cyclic reduction algorithm is applied to a block 
tridiagonal system in which the off-diagonal blocks are identity operators it leads to a 
sequence of block tri- diagonal systems of the form 

{A k u k )i = u\_ 2 , + a k u k i + u k i+2 , = v k i (7.1) 

with the right hand sides defined recursively by 

= P k v k i = v k i_ 2 > - a k v k i + v k i+2 > , (7.2) 

and in which the diagonal blocks a k are given by 

a* -1 = 21 - a k a k . (7.3) 

Observe that the operators A k defined by (7.1) and (7.2) and the projection operators 
P k defined by (7.2) and (7.3) are related by 

A 1 ’- 1 = P k A k , (7.4) 

which is (3.3) with Q k = I . This is another way of saying that the interpolation 
operators are not needed in this version of the algorithm TPMA. 
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8: Information Exchange on the CM2 

All multilevel algorithms require long distance communication, for it is by spreading 
problem information and solution information over long distances that they obtain their 
computational advantage. All of the totally parallel multilevel algorithms that we have 
discussed here have one further characteristic in common: the distances over which they 
communicate is always a power of two. This gives parallel computers with a hypercube 
architecture a considerable advantage, for on hypercubes it is possible to send a message 
a grid distance 2* in only two hops, provided the grid is laid out according to the binary 
reflected gray code BRGC (or at least a gray code which is a bit-permutation of BRGC). 

Moreover, if the computer is built to send information on two lines simultaneously, 
then there is a communication scheme with the property that a distance 2 fc exchange takes 
no longer, in principle, than a distance one exchange. Moreover, if information can be 
sent in both directions simultaneously on any line an exchange will be twice as fast as 
use of either get or send. We note that the CM2 satisfies the first property, and timing 
experiments indicate that the second property is close to true. 



Figure 2. Communication lines active during phase 0 of a distance 16 exchange 
are shown in (a), and those active during phase 1 of the exchange are shown in 
(b). 
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The exchange of information proceeds in two phases, during each of which half the 
processors are communicating with distant neighbors to their left (or below them) in grid 
ordering and half are communicating with neighbors to their right. As is clear from Fig. 3, 
one communication line is active during both phases, and is therefore a bottleneck which 
prevents a faster communication scheme. This is communication line k-1 for distance 2* 
communication. (When a higher dimensional grid is in use we consider only one dimension 
at a time, and refer to the lowest order line dedicated to that dimension as line 0.) 

During phase 0 of the exchange line k is also active on all processors, transferring 
information upwards for half of the processors and downward for the other half. Although 
it is not clear from the figure, the set of communication lines active during phase 1 is 
not the same on all processors. If processor i is exchanging with processor j during 
this phase, the other line in use is the index of the highest nonzero bit in i XOR j . 
The three communication lines active during this exchange are, therefore, k — 1, k and 
[ Ig ( i X OR j ) J . At this point we should observe that the communication required by 
the FFT is provided by phase 0 alone, and thus requires only half the time if it is arranged 
in this manner, as the communication required by the other versions of TPMA we have 
discussed. 
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